Sparse Pronunciation Codes for Perceptual Phonetic Information Assessment
نویسندگان
چکیده
Speech is a complex signal produced by a highly constrained articulation machinery. Neuro and psycholinguistic theories assert that speech can be decomposed into molecules of structured atoms. Although characterization of the atoms is controversial, the experiments support the notion of invariant speech codes governing speech production and perception. We exploit deep neural network (DNN) invariant representation learning for probabilistic characterization of the phone attributes defined in terms of the phonological classes and known as the smallestsize perceptual categories. We cast speech perception as a channel for phoneme information transmission via the phone attributes. Structured sparse codes are identified from the phonological probabilities for natural speech pronunciation. We exploit the sparse codes in information transmission analysis for assessment of phoneme pronunciation. The linguists define a single binary phonological code per phoneme. In contrast, probabilistic estimation of the phonological classes enables us to capture large variation in structures of speech pronunciation. Hence, speech assessment may not be confined to the single expert knowledge based mapping between phoneme and phonological classes and it may be extended to multiple data-driven mappings observed in natural speech. I. PROBABILISTIC PERCEPTION CHANNEL Phonemes are the set of unit sounds that distinguish one word from another in a particular language. The phoneme classes are denoted by an L dimensional random variable S with categorical distribution (ps1 , . . . , psL) where psl is the probability of phoneme sl. Each phoneme possesses some phone attributes. The phone attributes are defined in terms of phonological classes that describe some properties of sound production such as [vowel], [fricative], [dental] and [labial]. For example, a sound “m” has the following attributes: [anterior], [voice] and [nasal]. The set of K phonological classes is denoted by Q = {q1, . . . , qK} where qk is a discrete random variable taking binary values {0, 1}, with probability p(qk = 1) = pqk . The linguists define a binary association between phonemes and phonological classes [1]. The production theory of speech perception relies on detection of the underlying phone attributes and their unique combination to form the higher level phoneme perception. In practice however, detection of the phonological classes is not perfect. Hence, the present study relies on probabilistic characterization of the phonological classes as developed in [2]. To perform perceptual assessment, we consider speech perception as a channel having the phonological random variables at the input and phoneme random variables at the output [3]. Hence, we define zt as the random variable which can take values of the set of phonological classes Q = {q1, . . . , qK}; t indexes the temporal window. The posterior probabilities of all phonological classes {p(zt = q1|xt), . . . , p(zt = qK |xt)} are estimated by K deep neural network (DNN)s, each specifically trained to detect one of the classes from the input acoustic feature xt [2]. Given the speech transcription, the posterior representation of the frames labelled as phoneme sl yields p(zt|sl, xt). Accordingly, the posterior representation of the frames labelled as phonological class qk yields p(zt|qk, xt). The amount of information transmitted by the phone attribute qk for perception of phoneme sl is estimated as the multivariate mutual information expressed as follows where H denotes the entropy: Ik ≡ I(qk, sl, zt) = H(qk, sl, zt)−H(qk, sl)−H(sl, zt) −H(qk, zt) +H(sl) +H(qk) +H(zt) (1) To calculate this quantity, the DNN phonological posteriors are used as follows. If the acoustic frame xt is the result of the production of phoneme sl, we assume that p(xt|zt, sl) = p(xt|sl); the intuition is that the physical process leading to the production of xt is guided by sl and the variable zt is an abstract notion to exploit probabilistic association of the DNN outputs to all phonological classes. Hence, given the physical state of sl, the observation xt is independent of zt or by Bayes theorem p(zt|sl, xt) = p(zt|sl). Similarly, p(zt|qk, xt) = p(zt|qk) are used for the joint probabilities required to calculate (1) [3]. II. INFORMATION OF PRONUNCIATION CODES The perception of a trained speaker is sensitive to the structures underlying phone attributes during phoneme pronunciation [4]. To identify these structures, we consider binary representation of the phonological posteriors obtained via quantization [5]. The permissible structures corresponds to the indices of the non-zero components. The active components determine the posture of vocalization. Due to the constraints in articulation machinery, the binary codes are sparse. Fig. 1 illustrates an example of the structured sparsity underlying phonetic and phonological posteriors. The codes generated by vocalization are highly constrained. The linguists define unique binary codes per phoneme [1]. Probabilistic estimation of the phonological classes enables us to capture large variation in structures of speech pronunciation. Figure 2 depicts the pronunciation assessment procedure. We identify all the unique sparsity structures from a large speech corpora and construct a codebook of permissible pronunciations [5, 6]. Speech perception operates on the principle of merging independent evidences based on the sparse pronunciation codes. We define a code associated to phoneme sl as the set of cl = {q1, . . . , qKl} phonological classes. Following the principle of speech perception as partial recognition of independent phonological cues [7, 8], the probability of phoneme perception is calculated as independent combination of the constituting phonological class probabilities [3]. The information conveyed by the phonological code for perception of phoneme sl is calculated as Il = ∑Kl k=1 Ik. The difference in the transmitted information calculated for perfect speaking and distorted pronunciation demonstrates the level and well as the major phoneme classes distorted due to imperfect pronunciation. An example result is illustrated in Fig. 3; the single expert knowledge based pronunciation code is used for this illustration. ACKNOWLEDGEMENT We Acknowledge Swiss NSF funding “Parsimonious Hierarchical Automatic Speech Recognition and Query Detection” grant n. 200020-169398. Phonological Posteriors Frame Index 20 40 60 80 100 120 P h o n o lo g ic a l In d e x 2 4 6 8 10 12 Phonetic Posteriors Frame Index 20 40 60 80 100 120 P h o n e I n d e x
منابع مشابه
Assessment of Non-native Prosody for Spanish as L2 using quantitative scores and perceptual evaluation
In this work we present SAMPLE, a new pronunciation database of Spanish as L2, and first results on the automatic assessment of Nonnative prosody. Listen and repeat and read tasks are carried out by native and foreign speakers of Spanish. The corpus has been designed to support comparative studies and evaluation of automatic pronunciation error assessment both at phonetic and prosodic level. Fo...
متن کاملAcoustic Analysis of Persian EFL Learners' Pronunciation of English Vowels
This paper reports the results of an experimental study on non-native production of English vowels. Two groups of Persian EFL learners varying in language proficiency were tested on their ability to produce the nine plain vowels of American English. Vowel production accuracy was assessed by means of acoustic measurements. Ladefoged and Maddison’s (1996) F1 F2 measurements for American English v...
متن کاملPerceptual Assessment of the Degree of Russian Accent
This paper deals with the perceptual assessment of Russian-accented Estonian. Speech samples were recorded from 20 speakers with a Russian background; clips of about 20 seconds from each speaker were selected for this perceptual study. The accentedness was rated in two tests: first, 20 native Estonian speakers judged the samples and rated the degree of foreign accent on a six-point interval sca...
متن کاملThe Effect of Using Phonetic Websites on Iranian EFL Learners’ Word Level Pronunciation
Computer-assisted language learning (CALL) is reaching an up most position in the pedagogical field of English as a Second or Foreign Language (ESL/EFL). The present study was carried out to study the effect of using phonetic websites on Iranian EFL students’ pronunciation and knowledge of phonemic symbols. Participants of the study included 30 EFL female pre-intermediate students studyin...
متن کاملPronunciation variation speech recognition without dictionary modification on sparse database
Generally, a speech recognition system uses a fixed set of pronunciations according to the dictionary for training and decoding. However, even a well-defined lexicon cannot be used to support all variations in human’s pronunciation. Besides, in order to cover all possible pronunciations, the size of the dictionary would be too large to implement. Sharing gaussian densities across phonetic model...
متن کامل